Introduction

Suicide, as one of the leading cause of death, is the behavior of intentionally causing one’s own death. Nowadays, as the society develops at an increasingly rapid rate, more suicide incidents occur due to the increasing amount of stress and mental disorder illness including depression, anxiety disorders and so on.

To construct a system to analyze the suicide rate in different countries, age group, sex, classify the data and predict the rate in the future, statistical learning techniques have been applied to a dataset containing the suicide data from 2000 to 2016. The results show potential for such a system to be used to analyze the relationship between age, sex, country GDP per capital and the suicide rate and then to predict the rate in each country, especially given that the dataset utilized is from the official dataset provided by WHO. The results indicate that this prediction can be made with a reasonably small amount of error. However, practical and statistical limitations suggest the need for further investigation.


Methods

Data

This Kaggle 1 dataset pulled from four other datasets: United Nations Development Program 2, World Bank 3, Suicide in the Twenty-First Century 4, and World Health Organization 5 and summarized a set of potenial factors that influence the suicide rates across the world.

There are a total of 12 variables with 27,820 observations.

Attribute Information is listed below (The target variable is highlighted in red):

*See Appendix for variable description

  • country
  • year
  • sex
  • age
  • suicides_no
  • population
  • suicide/100k pop
  • country-year
  • HDI for year
  • gdp_for_year ($)
  • gdp_per_capita ($)
  • generation

Variable suicide/100k pop is calculated from suicides_no and population, so both variables are eliminated due to collinearity. Also, country-year replicates variables country and year and is removed. Gdp_for_year ($) is also not considered since it is directly linked to gdp_per_capita ($). Generation is an approximate variable similar to age, so we will also not consider generation. Therefore, the final dataset includes 7 variables.
In order to do classification of suicide rate, the suicide rate was classified by three levels: “High”, “Medium”, “Low”. The observation in the 75% quantile of all observations’ suicide rate are classified as “High”, observations between 25% quantile and 75% quantile are classified as “Medium” and the observation below 25 % quantile are classified as “Low”.

We also concern if the problem of data imbalance among different classes occuers, so we check the data and conclude that for this dataset there is no such problem existing.

Table: Check data imbalance
Fatal_rate Total Number
High 3240
Low 3253
Medium 6441

In preparation for model training, a training dataset is created using 80% of the provided data.

Modeling

Two regression and four classification models were trained, each using 5-fold cross-validation. A best model was chosen for each regression and classification.

Regression Models

  • A linear regression model using all predictors is used to fit the training dataset.

  • A kth-nearest neighbor model using all predictors is used to fit the training dataset.

Classification Models

  • A Random Forest model with oob resample method

  • A boosted model with cross validation resample method

  • A Multi nomial model with cross validation resampling method

  • A Neural network model with cross validation resample method

Models selection and evaluation is discussed in the results section.


Results

For regression models, the table below shows the result of the RMSEs of the predicted suicide rates using the two regression models on the training dataset. As a result, the linear regression has a lower RMSE, and therefore, is chosen to fit the test dataset which obtains a RMSE of 12.488.

Table: The Training RMSEs of the Two Regression Models
Models Training RMSE Rsquared MAE
Linear Regression 12.476 0.526 8.242
Knn Model 16.624 0.181 10.069

For the classification, The table below shows the result of each model with its highest accuracy rate and confusion matrices for the training dataset. Intermediate tuning results can be found in the appendix. Due to computational limitations, only three confusion matrices are presented. While the best result can be found within the random forest model. As a result, a random forest model classification is chosen since it has the highest accuracy rate which is much higher than others’.

Models were tuned for accuracy, but sensitivity was also considering when choosing a final model. Aside from random forest model, all models had similar performance.

Table: Model Accuracy
Model Accuracy
Random Forest 0.876
Boosted Model 0.828
Multinomial 0.837
Nureal Network 0.769
Table: Multiclass boosted, Cross-Validated Multiclass Predictions versus Multiclass Response, Percent
True Number of Valves
High Low Medium
High 20.458 0.951 3.564
Low 0.448 20.118 4.036
Medium 4.144 4.082 42.199
Table: Multiclass multi nominial, Cross-Validated Multiclass Predictions versus Multiclass Response, Percent
True Number of Valves
High Low Medium
High 20.566 0.804 3.139
Low 0.464 20.520 4.005
Medium 4.020 3.827 42.655
Table: Multiclass neural network, Cross-Validated Multiclass Predictions versus Multiclass Response, Percent
True Number of Valves
High Low Medium
High 19.337 0.503 4.059
Low 0.340 16.561 4.701
Medium 5.373 8.087 41.039

Within this test data, we can see the high accuracy from the confusion matrix of test data that the entries on the diagonal line are the highest ones in each row.

Table: Test Results, Random Forest Classification
Truth
High Medium Low
Predicted: High 704 13 84
Predicted: Medium 11 659 140
Predicted: Low 85 122 1416

Discussion

While our regression model did not meet the expected outcomes, we believe this analysis demonstrates a proof-of-concept for suicide rate prediction system. Using more data, both samples and features, this model could likely be improved before being put into practice.

Below is the predicted suicide rate plot by age using the linear regression model.

Below is the predicted suicide rate plot by age using the linear model.

RMSE in the knn model is higher than the linear regression and its plot shows an obvious spread of data points. Both graphs showed people whose ages are 75 years old or above tend to have a higher suicide rate than other groups. This finding could reveal some social issues. For example, do we care enough for the elderly in our family? Do we spend enough time with them and listening to their needs? It is an alarm bell for governments and people to pay more attention to aged people. Government could consider offering more social benefits to the elder and more time and care should be given to elderly people in the family.

When choosing the metric to assess our model performance, we considered accuracy rate, specificity and sensitivity. We finally decided to use accuracy because we are more concerned about the degree to which the classification result of the suicide rate conforms to the correct value. If we were building the classification model under other situations, credit card fraud classification (genuine, fraud) for instance, we might care more about sensitivity or specificity. The most severe situation is the card that is actually fraud but is classified as genuine by mistake, which will cause a lag to the bank to solve the problem and make the person suffer from a huge loss. But the suicide rate classification is different. Since our goal is to provide suicide rate reference for each country, misclassifying any level of suicide rate will result in the country’s misinterpretation on those rates. Therefore we chose overall accuracy rate as the metric to examine our models.

We found classification results from random forest are the optimal and the visualization below shows a direct comparison between actual value and predicted value. Observations that are correctly classified are shown in blue and misclassified observations are displayed in red. We can see the majority of the observations are in blue, indicating a relatively high accuracy rate (87.04 %). We can conclude the random forest model performed well with current data. As we obtain more data in the future, this model could be improved before being put into practice.


Appendix

Data Dictionary

  • country: Name of country
  • year: Year of the suicide rate
  • sex: gender of the suicide
  • age: age of the suicide
  • Suicides_no: number of suicides
  • population: Country population
  • suicide/100k: pop: Number of suicide per 100,000 population, suicide rate
  • Country-year: Country and year
  • HDI for year: The Human Development Index: a statistic composite index of life expectancy, education, and per capita income indicators
  • gdp_for_year ($): Country gdp for the year
  • gdp_per_capita ($): Country’s gross domestic product by its total population.
  • Generation: The generation range of the suicide

For additional information, see documentation on Kaggle.[^6]

EDA

Table: Summary Statistics
Country Sex Age Mean Suicide Rate
Albania female 15-24 years 4.015
Albania female 25-34 years 2.644
Albania female 35-54 years 2.164
Albania female 5-14 years 0.303
Albania female 55-74 years 1.567
Albania female 75+ years 3.806
Table: Summary Statistics
Country GDP per Year GDP per Capital Mean Suicide Rate
Albania 5211.661 1859 3.503
Antigua and Barbuda 803.545 10448 0.553
Argentina 274256.498 7914 10.469
Armenia 5386.592 1873 3.276
Aruba 2196.223 24221 9.503
Australia 632750.138 32776 12.993
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## `summarise()` has grouped output by 'country'. You can override using the
## `.groups` argument.
## Warning: Use of `country_data$country` is discouraged. Use `country` instead.
## Warning: Use of `country_data$suicide_stats` is discouraged. Use `suicide_stats`
## instead.



  1. Kaggle: Suicide Rates Overview 1985 to 2016↩︎

  2. United Nations Development Program. (2018). Human development index (HDI). Retrieved from http://hdr.undp.org/en/indicators/137506↩︎

  3. World Bank. (2018). World development indicators: GDP (current US$) by country:1985 to 2016. Retrieved from http://databank.worldbank.org/data/source/world-development-indicators#↩︎

  4. [Szamil]. (2017). Suicide in the Twenty-First Century [dataset]. Retrieved from https://www.kaggle.com/szamil/suicide-in-the-twenty-first-century/notebook↩︎

  5. World Health Organization. (2018). Suicide prevention. Retrieved from http://www.who.int/mental_health/suicide-prevention/en/↩︎